



|        |                                | Executed on: | Only callable<br>from: |
|--------|--------------------------------|--------------|------------------------|
| device | <pre>float deviceFunc();</pre> | device       | device                 |
| global | <pre>void kernelFunc();</pre>  | device       | host/device            |
| host   | <pre>float hostFunc();</pre>   | host         | host                   |

- Remarks:
  - global defines a kernel function
  - Each '\_\_' consists of two underscore characters
  - A kernel function must return void
  - device and \_\_host\_\_ can be used together





Example for the latter: make cuComplex usable on both device and host

```
struct cuComplex // define a class for complex numbers
Ł
  device host
  cuComplex(float a, float b) : r(a), i(b) {}
   device host
  float magnitude2( void )
  {
    return r * r + i * i;
  }
  // etc. ...
};
```



- An "optimization":
  - The sequence of z<sub>i</sub> can either converge towards a single (complex) value,
  - or it can end up in a cycle of values,
  - or it can be chaotic.
- Idea:
  - Try to recognize such cycles;
     if you realize that a thread is
     caught in a cycle, exit immediately
     (should happen much earlier in most cases)
  - Maintain an array of the k most recent elements of the sequence
- Last time I checked: 4x slower than the brute-force version!



All points here All points here converge converge towards cycle towards of length 2 fixed

All points here converge towards fixed point

G. Zachmann Massively Parallel Algorithms SS May 2014

# Querying the Device for its Capabilities



- How do you know how many threads can be in a block, etc.?
- Query your GPU, like so:

Bremen



## For Your Reference: the Complete Table of the cudaDeviceProp



| DEVICE PROPERTY                     | DESCRIPTION                                                           |  |  |  |  |  |  |  |  |
|-------------------------------------|-----------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| <pre>char name[256];</pre>          | An ASCII string identifying the device (e.g.,<br>"GeForce GTX 280")   |  |  |  |  |  |  |  |  |
| <pre>size_t totalGlobalMem</pre>    | The amount of global memory on the device in bytes                    |  |  |  |  |  |  |  |  |
| <pre>size_t sharedMemPerBlock</pre> | The maximum amount of shared memory a single block may use in bytes   |  |  |  |  |  |  |  |  |
| <pre>int regsPerBlock</pre>         | The number of 32-bit registers available per block                    |  |  |  |  |  |  |  |  |
| <pre>int warpSize</pre>             | The number of threads in a warp                                       |  |  |  |  |  |  |  |  |
| <pre>size_t memPitch</pre>          | The maximum pitch allowed for memory copies in bytes                  |  |  |  |  |  |  |  |  |
| <pre>int maxThreadsPerBlock</pre>   | The maximum number of threads that a block may contain                |  |  |  |  |  |  |  |  |
| <pre>int maxThreadsDim[3]</pre>     | The maximum number of threads allowed along each dimension of a block |  |  |  |  |  |  |  |  |
| <pre>int maxGridSize[3]</pre>       | The number of blocks allowed along each dimension of a grid           |  |  |  |  |  |  |  |  |
| <pre>size_t totalConstMem</pre>     | The amount of available constant memory                               |  |  |  |  |  |  |  |  |



| DEVICE PROPERTY                         | DESCRIPTION                                                                                                                   |
|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| int major                               | The major revision of the device's compute capability                                                                         |
| int minor                               | The minor revision of the device's compute capability                                                                         |
| <pre>size_t textureAlignment</pre>      | The device's requirement for texture alignment                                                                                |
| int deviceOverlap                       | A boolean value representing whether the device<br>can simultaneously perform a cudaMemcpy()<br>and kernel execution          |
| int multiProcessorCount                 | The number of multiprocessors on the device                                                                                   |
| <pre>int kernelExecTimeoutEnabled</pre> | A boolean value representing whether there is a runtime limit for kernels executed on this device                             |
| int integrated                          | A boolean value representing whether the device is<br>an integrated GPU (i.e., part of the chipset and not a<br>discrete GPU) |
| int canMapHostMemory                    | A boolean value representing whether the device<br>can map host memory into the CUDA device<br>address space                  |
| int computeMode                         | A value representing the device's computing mode:<br>default, exclusive, or prohibited                                        |
| int maxTexture1D                        | The maximum size supported for 1D textures                                                                                    |



| DEVICE PROPERTY                     | DESCRIPTION                                                                                                                      |
|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| <pre>int maxTexture2D[2]</pre>      | The maximum dimensions supported for 2D textures                                                                                 |
| <pre>int maxTexture3D[3]</pre>      | The maximum dimensions supported for 3D textures                                                                                 |
| <pre>int maxTexture2DArray[3]</pre> | The maximum dimensions supported for 2D texture arrays                                                                           |
| int concurrentKernels               | A boolean value representing whether the device<br>supports executing multiple kernels within the<br>same context simultaneously |





- Problem: your input, e.g. the vectors, is larger than the maximally allowed size along one dimension?
  - I.e., what if vec\_len > maxThreadsDim[0] \* maxGridSize[0]?
- Solution: partition the problem (color = thread ID)



## Example: Adding Huge Vectors



- Vectors of size 100,000,000 are not uncommon in highperformance computing (HPC) ...
- The thread layout:

Bremen

W

Kernel launch:

addVectors<<< blocks, threads >>>( d\_a, d\_b, d\_c, n );

Index computation in the kernel:

unsigned int tid\_x = blockDim.x \* blockIdx.x + threadIdx.x; unsigned int tid\_y = blockDim.y \* blockIdx.y + threadIdx.y; unsigned int i = tid\_y \* (blockDim.x \* gridDim.x) + tid\_x;





44

### Visualization of this index computation:







- Why is it so important to declare constant variables/instances in C/C++ as const?
- It allows the compiler to ...
  - optimize your program a lot
  - do more type-checking
- Something similar exists in CUDA  $\rightarrow$  constant memory

## Example: a Simple Raytracer



• The ray-tracing principle:

Bromon

lUjj

- 1. Shoot rays from camera through every pixel into scene (primary rays)
- 2. If the rays hits more than one object, then consider only the first hit
- 3. From there, shoot rays to all light sources (*shadow feelers*)
- If a shadow feeler hits another obj → point is in shadow w.r.t. that light source.
   Otherwise, evaluate a lighting model (e.g., Phong [see "Computer graphics"])
- 5. If the hit object is glossy, then shoot reflected rays into scene (secondary rays)  $\rightarrow$  recursion
- 6. If the hit object is transparent, then shoot refracted ray  $\rightarrow$  more recursion





- Bremen
- Simplifications (for now):
  - Only primary rays
  - Camera is at infinity → primary rays are orthogonal to image plane
  - Only spheres
    - They are so easy, every raytracer has them  $\ensuremath{\textcircled{\odot}}$









48

#### The data structures:

```
struct Sphere
{
    Vec3 center; // center of sphere
    float radius;
    Color r, g, b; // color of sphere
    __device__
    bool intersect( const Ray & ray, Hit * hit )
    {
        ...
    }
};
```





#### • The mechanics on the host side:

```
int main( void )
{
   // create host/device bitmaps (see Mandelbrot ex.)
   . . .
   Sphere * h spheres = new Sphere[n spheres];
   // generate spheres, or read from file
   // transfer spheres to device (later)
   // generate image by launching kernel
   // assumption: img size = multiple of block-size!
   dim3 threads (16, 16);
   dim3 blocks( img size/treads.x, img size/treads.y );
   raytrace<<<blocks,threads>>>( d bitmap );
   // display, clean up, and exit
};
```

#### The mechanics on the device side

Bremen

W



```
global
void raytrace( unsigned char * bitmap ) {
   // map thread id to pixel position
   int x = blockIdx.x * blockDim.x + threadIdx.x;
   int y = blockIdx.y * blockDim.y + threadIdx.y;
   int offset = x + y * (gridDim.x * blockDim.x);
  Ray ray(x, y, camera); // generate primary ray
   // check intersection with scene, take closest one
  min dist = INF;
   int hit sphere = MAX INT;
   Hit hit;
   for ( int i = 0; i < n spheres; i ++ ) {</pre>
      if ( intersect(ray, i, & hit) ) {
         if ( hit.dist < min dist ) {</pre>
           min dist = hit.dist; // found a closer hit
           hit sphere = i; // remember sphere; hit info
                                     // is already filled
         }
      }
   }
   // compute color at hit point (if any) and set in bitmap
```



## **Declaration & Transfer**



Since it is constant memory, we declare it as such:

```
const int MAX_NUM_SPHERES 1000;
_____constant___ Sphere c_spheres[MAX_NUM_SPHERES];
```

Transfer now works by a different kind of Memcpy:

# Bremen



- Access of constant memory on the device (i.e., from a kernel) works just like with any globally declared variable
- Example:

```
constant Sphere c spheres[MAX NUM SPHERES];
   device
 bool intersect( const Ray & ray, int s, Hit * hit )
    Vec3 m( c spheres[s].center - ray.orig );
     float q = m*m - c spheres[s].radius*c spheres[s].radius;
    float p = \dots
     solve quadratic( p, q, *t1, *t2 );
     . . .
                                                                 m
(t \cdot \mathbf{d} - \mathbf{m})^2 = r^2 \Rightarrow t^2 - 2t \cdot \mathbf{md} + \mathbf{m}^2 - r^2 = 0
```

# Some Considerations on Constant Memory



- Size of constant memory on the GPU is fairly limited (~48 kB)
  - Check cudaDeviceProp
- Reads from constant memory can be very fast:
  - "Nearby" threads accessing the same constant memory location incur only a single read operation (saves bandwidth by up to factor 16!)
  - Constant memory is cached (i.e., consecutive reads will not incur additional traffic)
- Caveats:

Bromon

lluïj

 If "nearby" threads read from different memory locations
 → traffic jam!





Bromon

lUŰ



Weft varn

- "Nearby threads" = all threads within a warp
- Warp := 32 threads next to each other
  - Each block's set of threads is partitioned into *warps*
  - All threads within a warp are executed on a single streaming multiprocessor (SM) in lockstep
- If all threads in a warp read from the same memory location → one read instruction by SM
- If all threads in a warp read from random memory locations → 32 different read instructions by SM, one after another!



Warp yarn

 In our raytracing example, everything is fine (if there is no bug <sup>(i)</sup>)

For more details: see "Performance with constant memory" on course web page



Bremen

# Overview of a GPU's Architecture





#### Nvidia's Kepler architecture as of 2012 (192 single-precision cores / 15 SM)



Bremen

U



| SMX                             |                                                        |       |                   |      |      |      |                   |       |        |                            |      |                   |         |      |                |      |         |       |     |  |
|---------------------------------|--------------------------------------------------------|-------|-------------------|------|------|------|-------------------|-------|--------|----------------------------|------|-------------------|---------|------|----------------|------|---------|-------|-----|--|
|                                 | War                                                    | p Sch | neduler           |      |      | Wa   | rp Sche           |       | tructi | on Cache<br>Warp Scheduler |      |                   |         |      | Warp Scheduler |      |         |       |     |  |
| Dispatch Dispatch               |                                                        |       | Dispatch Dispatch |      |      |      | Dispatch Dispatch |       |        |                            |      | Dispatch Dispatch |         |      |                |      |         |       |     |  |
| Register File (65,536 x 32-bit) |                                                        |       |                   |      |      |      |                   |       |        |                            |      |                   |         |      |                |      |         |       |     |  |
|                                 |                                                        |       |                   |      |      |      |                   |       |        |                            |      |                   | +       |      |                |      |         |       |     |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           | LD/ST | SFU    | Core                       | Core | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
| Core                            | Core                                                   | Core  | DP Unit           | Core | Core | Core | DP Unit           |       |        |                            |      | Core              | DP Unit | Core | Core           | Core | DP Unit | LD/ST | SFU |  |
|                                 | Interconnect Network<br>64 KB Shared Memory / L1 Cache |       |                   |      |      |      |                   |       |        |                            |      |                   |         |      |                |      |         |       |     |  |
| 48 KB Read-Only Data Cache      |                                                        |       |                   |      |      |      |                   |       |        |                            |      |                   |         |      |                |      |         |       |     |  |
|                                 | Tex                                                    |       | Tex               | :    |      | Tex  |                   | Tex   | ٢      |                            | Tex  |                   | Тех     | c .  |                | Tex  |         | Tex   | :   |  |
|                                 | Tex                                                    |       | Tex               | :    |      | Tex  |                   | Tex   | (      |                            | Tex  |                   | Tex Tex |      |                |      |         | Tex   |     |  |

# Thread Divergence Revisited



- This execution of threads in *lockstep fashion on one SM* (think SIMD) is the reason, why thread divergence is so bad
- Thread divergence can occur at each occurrence of if-thenelse, while, for, and switch (all flow control statements)

• Example:







The more complex your control flow graph (this is called cyclometric complexity), the more thread divergence can occur!



# Consequences for You as an Algorithm Designer / Programmer

Bremen

W



- Try to devise algorithms that consist of kernels with very low cyclometric complexity
- Avoid recursion (would probably further increase thread divergence)
  - The other reason is that we would need one stack per thread
  - If your algorithm heavily relies on recursion, then it may not be well suited for massive (data) parallelism!